[core][scalability] Change ray syncer from unary call to streaming call #30460

fishbone · 2022-11-18T03:56:47Z

Why are these changes needed?

To handle the failure of resource broadcasting, it's hard to do fault tolerance since the status need to be maintained.

This PR updated the communication protocol to streaming.

There are several things changed for the protocol:

Once we received the message, it'll be pushed immediately. But it'll be buffered (512kb), so the cost is not big.
If there is no more message or it exceeded the buffer, it'll flush.

The PR has been tested with 2k nodes (2 cpus per node) and 14k actors.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Yi Cheng <[email protected]>

…ream Signed-off-by: Yi Cheng <[email protected]>

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 · 2022-11-24T11:13:51Z

Is this ready to review? Btw, before merging more advanced features like this, should we turn on syncer by default first? Or is this necessary for that step?

fishbone · 2022-11-26T00:03:09Z

@rkooo567 no we shouldn't turn on syncer by default. At the beginning, step by step is my plan. But I realized that even just with the unary syncer, it's already a big feature and involve a lot of testing. Besides, handling failure with the unary rpc is very hard since a lot of states need to be kept. So I'd prefer longer time testing and roll out slowly. Besides the streaming way is simpler than the unary way and thus should be easier to maintain. I don't think we should waste time testing and staging this feature twice.

Thanks for the abstraction we made, most of the changes is in communication layer so you can see only the sending/receiving and connecting/disconnecting logic got changed.

And besides, there is no intention to keep the protocol the same as before (still very similar) since we changed the rpc from unary to streaming. We'll fix the issues when we doing testing.

Btw, I think callback API is just hard to get the things done without a lot of knowledge (threading things). I think after this, I'll try to figure out a good gRPC framework which we can use and use that in ray syncer.

scv119

Thanks for clean it up! You might want @rkooo567 to take another look as well.

scv119 · 2023-01-19T20:53:31Z

src/ray/common/ray_syncer/ray_syncer-inl.h

+/// disconnect from the remote node.
+/// For the implementation, in the constructor, it needs to connect to the remote
+/// node and it needs to implement the communication between the two nodes.
+class RaySyncerBidiReactor {


looks much cleaner.

Since RaySyncerBidiReactor is not doing too much, would merge RaySyncerBidiReactor into RaySyncerBidiReactorBase make sense?

Because server reactor and client reactor are two types and we maintain it in the same way in RaySyncer (it doesn't care about server/client), all are store in a map.

You can think this class more like an interface class.

Are you saying it is to keep RaySyncerBidiReactorBase<Server> and RaySyncerBidiReactorBase<Client> to the same map (as the same type)?

yes because for the class RaySyncer, it's the same logic.

src/ray/common/ray_syncer/ray_syncer-inl.h

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

Signed-off-by: Yi Cheng <[email protected]>

rkooo567

Maybe we should make reactor thread-safe? I saw this is used within io service and outside (also threading model is getting more confusing. Should I assume everything runs in the same thread or separate?)

rkooo567 · 2023-01-20T01:30:16Z

src/ray/common/ray_syncer/ray_syncer-inl.h


  /// Return the remote node id of this connection.
  const std::string &GetRemoteNodeID() const { return remote_node_id_; }

+  virtual void Disconnect() = 0;


Can you add a docstring to explain what this method is supposed to do?

rkooo567 · 2023-01-20T05:28:57Z

src/ray/common/ray_syncer/ray_syncer-inl.h

+/// disconnect from the remote node.
+/// For the implementation, in the constructor, it needs to connect to the remote
+/// node and it needs to implement the communication between the two nodes.
+class RaySyncerBidiReactor {


Are you saying it is to keep RaySyncerBidiReactorBase<Server> and RaySyncerBidiReactorBase<Client> to the same map (as the same type)?

rkooo567 · 2023-01-20T05:36:08Z

src/ray/common/ray_syncer/ray_syncer-inl.h

+
+  bool PushToSendingQueue(std::shared_ptr<const RaySyncMessage> message) override {
+    // Try to filter out the messages the target node already has.
+    // Usually it'll be the case when the message is generated from the


can you add a comment with a real world example? That might be easier to understand

This is not to take care of real world failure case. This is just to avoid the caller send the message to the same node. (optimization)

Oh, I meant it will be easier to understand the comment if you have example (e.g., resource update received from GCS doesn't need to be resent)

src/ray/common/ray_syncer/ray_syncer-inl.h

rkooo567 · 2023-01-20T05:39:06Z

src/ray/common/ray_syncer/ray_syncer-inl.h

+      io_context_.dispatch([this]() { SendNext(); }, "");
+    } else {
+      // No need to resent the message since if ok=false, it's the end
+      // of gRPC call and we'll reconnect in case of a failure.


can you also explain what will reconnect instead of "we"? Like the caller? The client side syncer?

also explain briefly what callback will be called from grpc (looks like it is OnDone?)

rkooo567 · 2023-01-20T05:54:26Z

src/ray/common/ray_syncer/ray_syncer.cc

+            io_context_,
+            [this](auto msg) { BroadcastRaySyncMessage(msg); },
+            [this, channel](const std::string &node_id, bool restart) {
+              sync_reactors_.erase(node_id);


no need to call Disconnect?

no. this is cleanup callback so it mean the disconnect has been called.

rkooo567 · 2023-01-20T05:55:49Z

src/ray/common/ray_syncer/ray_syncer.cc

+              if (restart) {
+                RAY_LOG(INFO) << "Connection is broken. Reconnect to node: "
+                              << NodeID::FromBinary(node_id);
+                Connect(node_id, channel);


If the failure is because the channel is closed from the server, is it going to recover it? No need to create a new channel?

channel will reconnect by itself and if it failed for a long time, it'll crash by ray design. no need to recreate channel.

rkooo567 · 2023-01-20T05:58:09Z

src/ray/common/ray_syncer/ray_syncer.cc

        for (const auto &[_, messages] : node_state_->GetClusterView()) {
-          for (auto &message : messages) {
+          for (const auto &message : messages) {


Can we change the name of API GetClusterView to GetPendingMessages or something? iterating messages from the cluster view sounds confusing

This actually is fetching the ClusterView.
It's a new node and we need to send the snapshot.

Hmm terminology is a bit confusing to me, but it is probably I am not very familiar with syncer (we can iterate messages returned from ClusterView sounds a bit weird to me)

ClusterView is a map from node -> status of the node.

rkooo567 · 2023-01-20T05:59:14Z

src/ray/common/ray_syncer/ray_syncer.cc

 }

 void RaySyncer::Disconnect(const std::string &node_id) {
+  std::promise<RaySyncerBidiReactor *> promise;


this promise block pattern seems to be used frequently. Is there any way to make it a general method? Something like postBlocking() or something

Maybe something like io_conext.sync_run() ? Don't want to add it here. I'll try to include that in the threading work.

rkooo567 · 2023-01-20T05:59:41Z

src/ray/common/ray_syncer/ray_syncer.cc

      },
      "RaySyncerDisconnect");
+  auto reactor = promise.get_future().get();
+  if (reactor != nullptr) {
+    reactor->Disconnect();


Doesn't it need to be protected?

It's not necessary to protect this.

But I'll make sure all public methods run in io context.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-01-20T20:06:25Z

I saw this is used within io service and outside

@rkooo567 I agree with you on this. The root cause is the gRPC thread and io thread both will call some methods there.

I think we need a wrapper on gRPC to fix it.

Signed-off-by: Yi Cheng <[email protected]>

rkooo567

LGTM. It'll be great if we have fault tolerance semantics documented somewhere.

Also, the main concern is the threading model is confusing and hard to understand (though it is same across ray 😞). I guess wrapping gRPC will make things more clear, then we should do it asap!

rkooo567 · 2023-01-24T03:40:36Z

src/ray/common/ray_syncer/ray_syncer.cc

        for (const auto &[_, messages] : node_state_->GetClusterView()) {
-          for (auto &message : messages) {
+          for (const auto &message : messages) {


Hmm terminology is a bit confusing to me, but it is probably I am not very familiar with syncer (we can iterate messages returned from ClusterView sounds a bit weird to me)

rkooo567 · 2023-01-24T03:43:09Z

src/ray/common/ray_syncer/ray_syncer-inl.h

+
+  bool PushToSendingQueue(std::shared_ptr<const RaySyncMessage> message) override {
+    // Try to filter out the messages the target node already has.
+    // Usually it'll be the case when the message is generated from the


Oh, I meant it will be easier to understand the comment if you have example (e.g., resource update received from GCS doesn't need to be resent)

rkooo567 · 2023-01-24T03:44:14Z

src/ray/common/ray_syncer/ray_syncer-inl.h

+      io_context_.dispatch([this]() { SendNext(); }, "");
+    } else {
+      RAY_LOG_EVERY_N(ERROR, 100)
+          << "Failed to send the message to: " << NodeID::FromBinary(GetRemoteNodeID());


resent because we will recreate the connection and it will be sent from there?

fishbone · 2023-01-24T19:27:42Z

LGTM. It'll be great if we have fault tolerance semantics documented somewhere.

Also, the main concern is the threading model is confusing and hard to understand (though it is same across ray 😞). I guess wrapping gRPC will make things more clear, then we should do it asap!

Threading is a big thing and will be the next step. This might not be the final solution we will choose. We'll see. But this will be the next step after light weight resource broadcasting enabled.

fishbone · 2023-01-24T19:29:04Z

resent because we will recreate the connection and it will be sent from there?

Yes!

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-01-24T19:29:41Z

merge master to pr.

rkooo567 · 2023-01-24T23:34:20Z

Before merging, should we do the following?

Turn on the flag (and turn off again before merging) and see CI
Run relevant nightly tests with the flag on

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-01-24T23:48:38Z

@rkooo567 I've tested them (CI https://buildkite.com/ray-project/oss-ci-build-pr/builds/10223#_). gcs ft related test failed, not sure the root cause.
And also run many actor tests mannually and it look good with 250 nodes (10% improvement in through put)

Overall it's ok.

Given this is a big PR changing the communication layers, I'm plan to check the ci test failure quickly and if hard, fix them later.
(It's going to be cleaner and easier for review)

fishbone · 2023-01-25T00:08:51Z

if we have fault tolerance semantics documented somewhere.

@rkooo567 I actually put some workflow here

It at least give the developer an overview of how the state changed.

Signed-off-by: Yi Cheng <[email protected]>

fishbone · 2023-01-25T02:50:39Z

gcs ft test fixed.

fishbone · 2023-01-25T20:56:23Z

all test passed when flag is on https://buildkite.com/ray-project/oss-ci-build-pr/builds/10260
ray syncer test failed in asan. i'm going to take another look, but i feel it's close.

Signed-off-by: Yi Cheng <[email protected]>

fishbone added 18 commits November 17, 2022 09:33

up

e07305d

Signed-off-by: Yi Cheng <[email protected]>

init check

5dc404f

Signed-off-by: Yi Cheng <[email protected]>

remove incorrect stopped

e15e050

Signed-off-by: Yi Cheng <[email protected]>

up

cf1300e

Signed-off-by: Yi Cheng <[email protected]>

add test and buffer

100575e

Signed-off-by: Yi Cheng <[email protected]>

fix

d5ee969

Signed-off-by: Yi Cheng <[email protected]>

fix

ae6e3e2

Signed-off-by: Yi Cheng <[email protected]>

up

2a20329

Signed-off-by: Yi Cheng <[email protected]>

up

bf9f305

Merge branch 'syncer-stream' of github.com:iycheng/ray into syncer-st…

56e833c

…ream Signed-off-by: Yi Cheng <[email protected]>

fix crash

13e194c

Signed-off-by: Yi Cheng <[email protected]>

up

39d3f88

Signed-off-by: Yi Cheng <[email protected]>

up

cbca804

Signed-off-by: Yi Cheng <[email protected]>

fix delete

efb6e34

Signed-off-by: Yi Cheng <[email protected]>

up

a06c95d

Signed-off-by: Yi Cheng <[email protected]>

up

24ba4f6

Signed-off-by: Yi Cheng <[email protected]>

update

1776bce

Signed-off-by: Yi Cheng <[email protected]>

up

a51d7e3

Signed-off-by: Yi Cheng <[email protected]>

fishbone changed the title ~~Syncer stream~~ [core] Change ray syncer from unary call to streaming call Nov 19, 2022

fishbone assigned scv119 and rkooo567 Nov 19, 2022

fishbone marked this pull request as ready for review November 19, 2022 04:31

fishbone changed the title ~~[core] Change ray syncer from unary call to streaming call~~ [core][scalability] Change ray syncer from unary call to streaming call Nov 23, 2022

fishbone added 3 commits November 23, 2022 02:27

fix lint

79d7670

Signed-off-by: Yi Cheng <[email protected]>

Merge remote-tracking branch 'upstream/master' into syncer-stream

c5a4a0c

Signed-off-by: Yi Cheng <[email protected]>

fix failure

96dcbb4

Signed-off-by: Yi Cheng <[email protected]>

fishbone mentioned this pull request Nov 23, 2022

[doc] Add a session in ray core doc for tips to run large ray cluster. #30599

Merged

7 tasks

fishbone linked an issue Nov 23, 2022 that may be closed by this pull request

Light weight resource broadcasting testing #30631

Closed

scv119 approved these changes Jan 19, 2023

View reviewed changes

fishbone commented Jan 20, 2023

View reviewed changes

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc Outdated Show resolved Hide resolved

fix comments

ef369fc

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 reviewed Jan 20, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2023

fix

77869c0

Signed-off-by: Yi Cheng <[email protected]>

fishbone added 2 commits January 20, 2023 21:26

fix comments

0045ead

Signed-off-by: Yi Cheng <[email protected]>

format

0d0bab2

Signed-off-by: Yi Cheng <[email protected]>

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2023

fishbone added 3 commits January 21, 2023 03:35

fix

a6b4b06

Signed-off-by: Yi Cheng <[email protected]>

add digram

28c22c5

Signed-off-by: Yi Cheng <[email protected]>

fix

edcc1af

Signed-off-by: Yi Cheng <[email protected]>

rkooo567 approved these changes Jan 24, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 24, 2023

Merge remote-tracking branch 'upstream/master' into syncer-stream

89e213e

Signed-off-by: Yi Cheng <[email protected]>

fix win build

972caac

Signed-off-by: Yi Cheng <[email protected]>

fix gcs-ft test

26a6d3a

Signed-off-by: Yi Cheng <[email protected]>

fix

d5b30e0

Signed-off-by: Yi Cheng <[email protected]>

fishbone merged commit 1f3226e into ray-project:master Jan 26, 2023

[core][scalability] Change ray syncer from unary call to streaming call #30460

[core][scalability] Change ray syncer from unary call to streaming call #30460

Conversation

fishbone commented Nov 18, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 commented Nov 24, 2022

fishbone commented Nov 26, 2022 • edited Loading

scv119 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Jan 20, 2023

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fishbone commented Jan 24, 2023

fishbone commented Jan 24, 2023

fishbone commented Jan 24, 2023 • edited Loading

rkooo567 commented Jan 24, 2023 • edited Loading

fishbone commented Jan 24, 2023 • edited Loading

fishbone commented Jan 25, 2023

fishbone commented Jan 25, 2023

fishbone commented Jan 25, 2023

fishbone commented Nov 18, 2022 •

edited

Loading

fishbone commented Nov 26, 2022 •

edited

Loading

fishbone commented Jan 24, 2023 •

edited

Loading

rkooo567 commented Jan 24, 2023 •

edited

Loading

fishbone commented Jan 24, 2023 •

edited

Loading